<!DOCTYPE html>

<html>

  <head>
    <title>Ch. 11 - Reinforcement Learning</title>
    <meta name="Ch. 11 - Reinforcement Learning" content="text/html; charset=utf-8;" />
    <link rel="canonical" href="http://manipulation.csail.mit.edu/rl.html" />

    <script src="https://hypothes.is/embed.js" async></script>
    <script type="text/javascript" src="chapters.js"></script>
    <script type="text/javascript" src="htmlbook/book.js"></script>

    <script src="htmlbook/mathjax-config.js" defer></script> 
    <script type="text/javascript" id="MathJax-script" defer
      src="https://cdn.jsdelivr.net/npm/mathjax@3/es5/tex-chtml.js">
    </script>
    <script>window.MathJax || document.write('<script type="text/javascript" src="htmlbook/MathJax/es5/tex-chtml.js" defer><\/script>')</script>

    <link rel="stylesheet" href="htmlbook/highlight/styles/default.css">
    <script src="htmlbook/highlight/highlight.pack.js"></script> <!-- http://highlightjs.readthedocs.io/en/latest/css-classes-reference.html#language-names-and-aliases -->
    <script>hljs.initHighlightingOnLoad();</script>

    <link rel="stylesheet" type="text/css" href="htmlbook/book.css" />
  </head>

<body onload="loadChapter('manipulation');">

<div data-type="titlepage" pdf="no">
  <header>
    <h1><a href="index.html" style="text-decoration:none;">Robotic Manipulation</a></h1>
    <p data-type="subtitle">Perception, Planning, and Control</p> 
    <p style="font-size: 18px;"><a href="http://people.csail.mit.edu/russt/">Russ Tedrake</a></p>
    <p style="font-size: 14px; text-align: right;"> 
      &copy; Russ Tedrake, 2020-2023<br/>
      Last modified <span id="last_modified"></span>.</br>
      <script>
      var d = new Date(document.lastModified);
      document.getElementById("last_modified").innerHTML = d.getFullYear() + "-" + (d.getMonth()+1) + "-" + d.getDate();</script>
      <a href="misc.html">How to cite these notes, use annotations, and give feedback.</a><br/>
    </p>
  </header>
</div>

<p pdf="no"><b>Note:</b> These are working notes used for <a
href="http://manipulation.csail.mit.edu/Fall2023/">a course being taught
at MIT</a>. They will be updated throughout the Fall 2023 semester.  <!-- <a 
href="https://www.youtube.com/channel/UChfUOAhz7ynELF-s_1LPpWg">Lecture  videos are available on YouTube</a>.--></p> 

<table style="width:100%;" pdf="no"><tr style="width:100%">
  <td style="width:33%;text-align:left;"><a class="previous_chapter" href=deep_perception.html>Previous Chapter</a></td>
  <td style="width:33%;text-align:center;"><a href=index.html>Table of contents</a></td>
  <td style="width:33%;text-align:right;"><a class="next_chapter" href=drake.html>Next Chapter</a></td>
</tr></table>

<script type="text/javascript">document.write(notebook_header('rl'))
</script>
<!-- EVERYTHING ABOVE THIS LINE IS OVERWRITTEN BY THE INSTALL SCRIPT -->
<chapter style="counter-reset: chapter 10"><h1>Reinforcement Learning</h1>
  <div style="clear:right;"></div>

  <p>These days, there is a lot of excitement around reinforcement learning
  (RL), and a lot of literature available.  The scope of what one might
  consider to be a reinforcement learning algorithm has also broaden
  significantly. The classic (and now updated) and still best introduction to
  RL is the book by Sutton and Barto <elib>Sutton18</elib> .  For a more
  theoretical view, try <elib>Agarwal20b+Szepesvari10</elib>.  There
  are some great online courses, as well: <a
  href="http://web.stanford.edu/class/cs234/">Stanford CS234</a>, <a
  href="https://rail.eecs.berkeley.edu/deeprlcourse/">Berkeley CS285</a>, <a
  href="https://deepmind.com/learning-resources/-introduction-reinforcement-learning-david-silver">DeepMind
  x UCL</a>.</p>

  <p>My goal for this chapter is to provide enough of the fundamentals to get
  us all on the same page, but to focus primarily on the RL ideas (and
  examples) that are particularly relevant for manipulation.  And manipulation
  is a great playground for RL, due to the need for rich perception, for
  operating in diverse environments, and potentially with rich contact
  mechanics.  Many of the core applied research results have been motivated by
  and demonstrated on manipulation examples!</p>

  <section><h1>RL Software</h1>
  
    <p>There are now a huge variety of RL toolboxes available online, with
    widely varying levels of quality and sophistication.  But there is one
    standard that has clearly won out as the default interface with which one
    should wrap up their simulator in order to connect to a variety of RL
    algorithms: the <a href="https://gymnasium.farama.org/">Gymnasium</a>
    (which is often still called "Gym" for short).</p>

    <p>It's worth taking a minute to appreciate the difference in the 
    Gymnasium <a href="https://gymnasium.farama.org/api/env/">Environments</a>
    (<code>gymnasium.Env</code>) interface and the Drake System interface; I
    think it is very telling.  My goal (in Drake), is to present you with a
    rich and beautiful interface to express and optimize dynamical systems, to
    expose and exploit all possible structure in the governing equations.  The
    goal in Gym is to expose the absolute minimal details, so that it's
    possible to easily wrap every possible system under the same common
    interface (it doesn't matter if it's a robot, an Atari game, or even a <a
    href="https://github.com/facebookresearch/CompilerGym">compiler</a>).
    Almost by definition, you can wrap any Drake system as a Gym
    environment.</p>  

    <example id="drake_gym">
      <h1>Using a Drake simulation as an Gym environment</h1>
    
      <p>A Gym environment is an incredibly simple wrapper around simulators
      which offers a <a
      href="https://github.com/openai/gym/blob/6a04d49722724677610e36c1f92908e72f51da0c/gym/core.py#L35">very
      basic interface</a>, most notably consisting of <code>reset()</code>,
      <code>step()</code>, <code>render()</code>. The
      <code>step()</code> method returns the current observations and the
      one-step reward (as well as some additional termination conditions).
      </p>

      <p>You can wrap any Drake simulation in an OpenAI gym environment, using 
      <pre><code class="python">from pydrake.gym import DrakeGymEnv</code></pre>
      The <a
      href="https://drake.mit.edu/pydrake/pydrake.gym.html?highlight=drakegym#pydrake.gym.DrakeGymEnv"><code>DrakeGymEnv</code>
      constructor</a> takes a <code>Simulator</code> as well as an input port
      to associate with the actions, an output port to associate with the
      observations, etc. For the reward, you can implement it as a simple function of the <code>Simulator</code>
      <code>Context</code>, or as another output port.
      </p>

      <p><code>DrakeGymEnv</code> is built around a <code>Simulator</code>
      (not just a <code>System</code>) or a function that produces a random
      <code>Simulator</code> because you might want to control the integrator
      parameters or have each rollout contain the same robot in a different
      environment, with potentially different numbers of objects. This would
      mean that the underlying <code>System</code> might have a different
      number of states / ports. The notion of a function that can produce
      simulators, referred to as a <code>SimulatorFactory</code>, is core to
      the <a
      href="https://drake.mit.edu/doxygen_cxx/group__stochastic__systems.html">stochastic
      system modeling framework</a> in Drake.</p>
    
    </example>
    
    <p>You can also use any Gym environment in the Drake ecosystem; you just
    won't be able to apply some of the more advanced algorithms that Drake
    provides.  Of course, I think that you should use Drake for your work in
    RL, too (many people do), because it provides a the rich library of
    dynamical systems that are rigorously authored and tested, including a
    great physics engine for dealing with contact, and leaves open the option
    to put RL approaches head-to-head against more model-based alternatives. I
    admit I might be a little biased. At any rate, that's the approach we will
    take in these notes.</p>
    
    <p>Some people <a
    href="http://www.incompleteideas.net/IncIdeas/BitterLesson.html">might
    argue</a> that the more thoughtfully you model your system, the more
    assumptions you have baked in, making yourself susceptible to "sim2real"
    gaps; but I think that's simply not the case.  Thoughtful modeling includes
    making uncertainty models that can account for as narrow or broad of a
    class of systems as we aim to explore; good things happen when we can make
    the uncertainty models themselves <i>structured</i>. I think one of the
    most fundamental challenges waiting for us at the intersection of
    reinforcement learning and control is a deeper understanding of the class
    of models that is rich enough to describe the diversity and richness of
    problems we are exploring in manipulation (e.g. with RGB cameras as inputs)
    while providing somewhat more structure that we can exploit with stronger
    algorithms.  Importantly, these models should continually expand and
    improve with data.</p>  <todo>callout to intuitive physics chapter once it
    exists</todo>

    <!-- another point: the RL approach has been successful in part because
    it's easy.  approachable with less training, and as a result got more
    people/funding put against it. that's super valuable.  controls as a field,
    on the other hand, has made itself somewhat intentionally obscure, perhaps
    partly out of arrogance.  that is perhaps the more bitter lesson.  -->

    <p>Gymnasium provides an interface for RL environments, but doesn't
    provide the implementation of the actual RL algorithms.  There are a large
    number of popular repositories for the algorithms, too.  As of this
    writing, I would recommend <a
    href="https://stable-baselines3.readthedocs.io/en/master/index.html">Stable
    Baselines 3</a>: it provides a very nice and thoughtfully-documented set of
    implementations in PyTorch.</p>

    <!--
      - the env interface is now standard, but the policy interface is not.
        everyone rolls their own, and mostly assumes nn.Module as a base class.
      - SB3 lets you define a custom "policy" (actually policy + value func),
        but doesn't currently support dynamic policies (LSTM was removed at
        some point; hopefully it will come back)
      - I really dislike that the notion of value function and Q function are
        applied directly to observations.  (In SB3, but most other packages,
        too).  That's bad form.
      -->


    <p>One other class of algorithms that is very relevant to RL but not
    specifically designed for RL is algorithms for black-box optimization.  I
    quite like <a
    href="https://facebookresearch.github.io/nevergrad/">Nevergrad</a>, and
    will also use that here.</p>

    <!-- We have some relevant nevergrad <=> drake (e.g. mathematicalprogram)
        code in anzu. -->
  
  </section>

  <section><h1>Policy-gradient methods</h1>

    <!-- Note: I have code for the plate pickup in
    Dropbox/notebooks/low_dof_plate_pick_experimentation and in Google
    Drive/Colab notebook/Plate pickup.ipynb -->

    <subsection><h1>Black-box optimization</h1>

      <p></p>
  
      <example id="nevergrad"><h1>Black-box optimization.</h1>
    
        Coming to deepnote soon (I have to handle the dependencies). The example is available in <code>rl/black_box.ipynb</code>.
        <!--
        <script>document.write(notebook_link('rl', deepnote, '', 'black_box'))
        </script>
        -->
      </example>

    </subsection>
  
    <subsection><h1>Stochastic optimal control</h1>
    
    </subsection>

    <subsection><h1>Using gradients of the policy, but not the environment</h1>
    
      <p>You can find more details on the derivation and some basic analysis of
      these algorithms <a
      href="http://underactuated.csail.mit.edu/rl_policy_search.html">here</a>.</p>
    
    </subsection>

    <subsection><h1>REINFORCE, PPO, TRPO</h1>
    
      <todo>Example of each of these algorithms on six-hump camel (with trivial environment/policy).</todo>

      <example id="box_flipup"><h1>Flipping up the box.</h1>
    
        Coming to deepnote soon (I have to handle the dependencies). The example is available in <code>rl/box_flipup.ipynb</code>.
        <!--
        <script>document.write(notebook_link('rl', deepnote, '', 'box_flipup'))</script>
        -->
  
      </example>

    </subsection>

    <subsection><h1>Control for manipulation should be easy</h1>
    
      <p>This is a great time for theoretical RL + controls, with experts from
      controls embracing new techniques and insights from machine learning, and
      vice versa.  As a simple example, we've increasingly come to understand
      that, even though the cost landscape for many classical control problems
      (like the linear quadratic regulator) is not convex in the typical policy
      parameters, we <a
      href="http://underactuated.csail.mit.edu/policy_search.html">now
      understand that gradient descent still works</a> for these problems
      (there are no local minima), and the class of problems/parameterizations
      for which we can make statements like this is growing rapidly.</p>
    </subsection>


  </section>

  <section><h1>Value-based methods</h1>
  
  
  
  </section>

  <section><h1>Model-based RL</h1>
  
    <todo>MPC with learned models, typically represented with a neural network.</todo>

  </section>

  <section><h1>Exercises</h1>
    <exercise><h1>Stochastic Optimization</h1>
      <p> For this exercise, you will implement a stochastic optimization scheme that does not require exact analytical gradients. You will work exclusively in <script>document.write(notebook_link('rl', d=deepnote, link_text='this notebook', notebook='stochastic_optimization'))</script>. You will be asked to complete the following steps: </p>
      <ol type="a">
        <li> Implement gradient descent with exact analytical gradients.
        </li>
        <li> Implement stochastic gradient descent with approximated gradients. </li>
        <li> Prove that the expected value of the stochastic update does not change with baselines.</li>
        <li> Implement stochastic gradient descent with baselines. 
        </li>
      </ol>
    </exercise>    

    <exercise><h1>REINFORCE</h1>
      <p> For this exercise, you will implement the vanilla REINFORCE algorithm on a box pushing task. You will work exclusively in <script>document.write(notebook_link('rl', d=deepnote, link_text='this notebook', notebook='policy_gradient'))</script>. You will be asked to complete the following steps: </p>
      <ol type="a">
        <li> Implement the policy loss function. </li>
        <li> Implement the value loss function. </li>
        <li> Implement the advantage function. </li>
      </ol>
    </exercise>
  </section>

</chapter>
<!-- EVERYTHING BELOW THIS LINE IS OVERWRITTEN BY THE INSTALL SCRIPT -->

<div id="references"><section><h1>References</h1>
<ol>

<li id=Sutton18>
<span class="author">Richard S. Sutton and Andrew G. Barto</span>, 
<span class="title">"Reinforcement Learning: An Introduction"</span>, MIT Press
, <span class="year">2018</span>.

</li><br>
<li id=Agarwal20b>
<span class="author">Alekh Agarwal and Nan Jiang and Sham M. Kakade and Wen Sun</span>, 
<span class="title">"Reinforcement Learning: Theory and Algorithms"</span>, Online Draft
, <span class="year">2020</span>.

</li><br>
<li id=Szepesvari10>
<span class="author">Csaba Szepesvari</span>, 
<span class="title">"Algorithms for Reinforcement Learning"</span>, Morgan and Claypool Publishers
, <span class="year">2010</span>.

</li><br>
</ol>
</section><p/>
</div>

<table style="width:100%;" pdf="no"><tr style="width:100%">
  <td style="width:33%;text-align:left;"><a class="previous_chapter" href=deep_perception.html>Previous Chapter</a></td>
  <td style="width:33%;text-align:center;"><a href=index.html>Table of contents</a></td>
  <td style="width:33%;text-align:right;"><a class="next_chapter" href=drake.html>Next Chapter</a></td>
</tr></table>

<div id="footer" pdf="no">
  <hr>
  <table style="width:100%;">
    <tr><td><a href="https://accessibility.mit.edu/">Accessibility</a></td><td style="text-align:right">&copy; Russ
      Tedrake, 2023</td></tr>
  </table>
</div>


</body>
</html>
